26 research outputs found
The Conversation: Deep Audio-Visual Speech Enhancement
Our goal is to isolate individual speakers from multi-talker simultaneous
speech in videos. Existing works in this area have focussed on trying to
separate utterances from known speakers in controlled environments. In this
paper, we propose a deep audio-visual speech enhancement network that is able
to separate a speaker's voice given lip regions in the corresponding video, by
predicting both the magnitude and the phase of the target signal. The method is
applicable to speakers unheard and unseen during training, and for
unconstrained environments. We demonstrate strong quantitative and qualitative
results, isolating extremely challenging real-world examples.Comment: To appear in Interspeech 2018. We provide supplementary material with
interactive demonstrations on
http://www.robots.ox.ac.uk/~vgg/demo/theconversatio
Learning to Ground Instructional Articles in Videos through Narrations
In this paper we present an approach for localizing steps of procedural
activities in narrated how-to videos. To deal with the scarcity of labeled data
at scale, we source the step descriptions from a language knowledge base
(wikiHow) containing instructional articles for a large variety of procedural
tasks. Without any form of manual supervision, our model learns to temporally
ground the steps of procedural articles in how-to videos by matching three
modalities: frames, narrations, and step descriptions. Specifically, our method
aligns steps to video by fusing information from two distinct pathways: i) {\em
direct} alignment of step descriptions to frames, ii) {\em indirect} alignment
obtained by composing steps-to-narrations with narrations-to-video
correspondences. Notably, our approach performs global temporal grounding of
all steps in an article at once by exploiting order information, and is trained
with step pseudo-labels which are iteratively refined and aggressively
filtered. In order to validate our model we introduce a new evaluation
benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of
HowTo100M\footnote{A test server is accessible at
\url{https://eval.ai/web/challenges/challenge-page/2082}.} with steps sourced
from wikiHow articles. Experiments on this benchmark as well as zero-shot
evaluations on CrossTask demonstrate that our multi-modality alignment yields
dramatic gains over several baselines and prior works. Finally, we show that
our inner module for matching narration-to-video outperforms by a large margin
the state of the art on the HTM-Align narration-video alignment benchmark.Comment: 17 pages, 4 figures and 10 table
Counterfactual Multi-Agent Policy Gradients
Cooperative multi-agent systems can be naturally used to model many real
world problems, such as network packet routing and the coordination of
autonomous vehicles. There is a great need for new reinforcement learning
methods that can efficiently learn decentralised policies for such systems. To
this end, we propose a new multi-agent actor-critic method called
counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised
critic to estimate the Q-function and decentralised actors to optimise the
agents' policies. In addition, to address the challenges of multi-agent credit
assignment, it uses a counterfactual baseline that marginalises out a single
agent's action, while keeping the other agents' actions fixed. COMA also uses a
critic representation that allows the counterfactual baseline to be computed
efficiently in a single forward pass. We evaluate COMA in the testbed of
StarCraft unit micromanagement, using a decentralised variant with significant
partial observability. COMA significantly improves average performance over
other multi-agent actor-critic methods in this setting, and the best performing
agents are competitive with state-of-the-art centralised controllers that get
access to the full state
My lips are concealed: Audio-visual speech enhancement through obstructions
Our objective is an audio-visual model for separating a single speaker from a
mixture of sounds such as other speakers and background noise. Moreover, we
wish to hear the speaker even when the visual cues are temporarily absent due
to occlusion. To this end we introduce a deep audio-visual speech enhancement
network that is able to separate a speaker's voice by conditioning on both the
speaker's lip movements and/or a representation of their voice. The voice
representation can be obtained by either (i) enrollment, or (ii) by
self-enrollment -- learning the representation on-the-fly given sufficient
unobstructed visual input. The model is trained by blending audios, and by
introducing artificial occlusions around the mouth region that prevent the
visual modality from dominating. The method is speaker-independent, and we
demonstrate it on real examples of speakers unheard (and unseen) during
training. The method also improves over previous models in particular for cases
of occlusion in the visual modality.Comment: Accepted to Interspeech 201
ASR is all you need: cross-modal distillation for lip reading
The goal of this work is to train strong models for visual speech recognition
without requiring human annotated ground truth data. We achieve this by
distilling from an Automatic Speech Recognition (ASR) model that has been
trained on a large-scale audio-only corpus. We use a cross-modal distillation
method that combines Connectionist Temporal Classification (CTC) with a
frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that
ground truth transcriptions are not necessary to train a lip reading system;
(ii) we show how arbitrary amounts of unlabelled video data can be leveraged to
improve performance; (iii) we demonstrate that distillation significantly
speeds up training; and, (iv) we obtain state-of-the-art results on the
challenging LRS2 and LRS3 datasets for training only on publicly available
data.Comment: ICASSP 202
Deep Lip Reading: a comparison of models and an online application
The goal of this paper is to develop state-of-the-art models for lip reading
-- visual speech recognition. We develop three architectures and compare their
accuracy and training times: (i) a recurrent model using LSTMs; (ii) a fully
convolutional model; and (iii) the recently proposed transformer model. The
recurrent and fully convolutional models are trained with a Connectionist
Temporal Classification loss and use an explicit language model for decoding,
the transformer is a sequence-to-sequence model. Our best performing model
improves the state-of-the-art word error rate on the challenging BBC-Oxford Lip
Reading Sentences 2 (LRS2) benchmark dataset by over 20 percent.
As a further contribution we investigate the fully convolutional model when
used for online (real time) lip reading of continuous speech, and show that it
achieves high performance with low latency.Comment: To appear in Interspeech 201
Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Procedural activity understanding requires perceiving human actions in terms
of a broader task, where multiple keysteps are performed in sequence across a
long video to reach a final goal state -- such as the steps of a recipe or a
DIY fix-it task. Prior work largely treats keystep recognition in isolation of
this broader structure, or else rigidly confines keysteps to align with a
predefined sequential script. We propose discovering a task graph automatically
from how-to videos to represent probabilistically how people tend to execute
keysteps, and then leverage this graph to regularize keystep recognition in
novel videos. On multiple datasets of real-world instructional videos, we show
the impact: more reliable zero-shot keystep localization and improved video
representation learning, exceeding the state of the art.Comment: Technical Repor
Seeing wake words: Audio-visual Keyword Spotting
The goal of this work is to automatically determine whether and when a word
of interest is spoken by a talking face, with or without the audio. We propose
a zero-shot method suitable for in the wild videos. Our key contributions are:
(1) a novel convolutional architecture, KWS-Net, that uses a similarity map
intermediate representation to separate the task into (i) sequence matching,
and (ii) pattern detection, to decide whether the word is there and when; (2)
we demonstrate that if audio is available, visual keyword spotting improves the
performance both for a clean and noisy audio signal. Finally, (3) we show that
our method generalises to other languages, specifically French and German, and
achieves a comparable performance to English with less language specific data,
by fine-tuning the network pre-trained on English. The method exceeds the
performance of the previous state-of-the-art visual keyword spotting
architecture when trained and tested on the same benchmark, and also that of a
state-of-the-art lip reading method
Watch, read and lookup: learning to spot signs from multiple supervisors
The focus of this work is sign spotting - given a video of an isolated sign,
our task is to identify whether and where it has been signed in a continuous,
co-articulated sign language video. To achieve this sign spotting task, we
train a model using multiple types of available supervision by: (1) watching
existing sparsely labelled footage; (2) reading associated subtitles (readily
available translations of the signed content) which provide additional
weak-supervision; (3) looking up words (for which no co-articulated labelled
examples are available) in visual sign language dictionaries to enable novel
sign spotting. These three tasks are integrated into a unified learning
framework using the principles of Noise Contrastive Estimation and Multiple
Instance Learning. We validate the effectiveness of our approach on low-shot
sign spotting benchmarks. In addition, we contribute a machine-readable British
Sign Language (BSL) dictionary dataset of isolated signs, BSLDict, to
facilitate study of this task. The dataset, models and code are available at
our project page.Comment: Appears in: Asian Conference on Computer Vision 2020 (ACCV 2020) -
Oral presentation. 29 page